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he right programming language can make alll the difference 
in how easy it is to write a program. This is why a pro- 
grammer's arsenal holds not only general-purpose languages 
like C and its relatives, but also programmable shells, script- 
ing languages, and lots of application-specific languages. 

The power of good notation reaches beyond traditional pro- 
gramming into specialized problem domains. HTML lets us cre- 
ate interactive documents, often using embedded programs in 
other language ript expresses an en- 
tire document as a specialized program. Spreadsheets and word 
processors often include languages to evaluate expressions, ac- 
cess information, and control layout. 

Regular expressions are one of the most broadly applicable spe- 
cialized languages, a compact and expressive notation for describing 
patterns of text. Regular expressions are algorithmically interest- 
ing, easy to implement in their simpler forms, and very useful. 

Regular expressions come in several flavors. The so-called 
“wildcards” used in command-line processors or shells to match 
patterns of file names are a particularly simple example. Typi- 
cally, “*” is taken to mean “any string of characters,” so a com- 
mand like 


del #.exe 

























uses a pattern “#,exe” that matches all files with names that con- 
tain any string followed by the literal string “.exe”. 

Regular expressions pervade UNIX, in editors, tools like grep, 
and scripting languages like Awk, Perl, and Tel. Although the 
variations among different programs may suggest that regular 
expressions are an ad hoc mechanism, they are, in fact, a lan- 
guage in a strong technical sense—a formal grammar specifies 
their structure and a precise meaning can be attached to each 
utterance in the language. Furthermore, the right implementa- 
tion can run very fast; a combination of theory and engineering 
practice pays off handsomely. 


The Language of Regular Expressions 

A regular expression is a sequence of characters that defines 
a pattern. Most characters in the pattern simply match them- 
selves in a target string, so the regular expression “abe” match- 
es that sequence of three letters wherever it occurs in the tar- 
get. A few characters are used in patterns as metacharacters 
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to indicate repetition, grouping, or positioning. In POSIX regu- 
lar expressions, “A” stands for the beginning of a string and “$” 
for the end, so “Ax” matches an “x” only at the beginning of a 
string, “x$” matches an “x” only at the end, “Ax$" matches “x” 
only if it is the sole character of the suring, and “A$” matches the 
empty string. 

The character “.” (a period) matches any character, so “x.y” match- 
es “say,” “x2y," and so on, but not “xy” or “xyxy.” The regular ex- 

ion “A.$” matches a string that contains any single character. 

A set of characters inside brackets “[]” matches any single one 
of the enclosed characters; for example, “{0123456789]" match- 
es a single digit. This pattern may be abbreviated “[0—9].” 

These building blocks are combined with parentheses for 
grouping, “|” for alternatives, “#” for zero or more occurrences, 
“+" for one or more occurrences, and “?” for zero or one oc- 
currences. Finally, “\" is used prefix to quote a metachar- 
acter and turn off its special meaning. 

These can be combined into remarkably rich patterns. For ex- 
ample, “\.[0—9}+" matches a period followed by one or more dig- 
its; “{(0—9}+\.10—9}*” matches one or more digits followed by an 
optional period and zero or more further digits; “(\+|-)” matches 
a plus or a minus ("\+" is a literal plus sign); and “leE|(\+|210—9}" 
matches an “e” or “E followed by an optional sign and one or more 
digits. These are combined in the following pattern that matches 
floating-point numbers: 


(\+ | DA10-914\ 0-91 | \ {0-94 Mle \+ | 940-9142 












Example 1: The fiction mate! s whether a 


string matches a regular expr 


determin 
ion. 


Example 2: The recursive function matchhere does most 
of the work. 


Example 3: The function matchstar is called when the 
expression begins with a starred character. 
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A Regular Expression Search Function 

Some systems include a regular expression library, usually called 
“regex” or “regexp.” However, if this is not available, it's easy to 
implement a modest subset of the full regular expression lan- 
guage. The regular expressions we present here make use of 
four metacharacters: “A,” “$," *.," and “#,” with “*” specifying 
zero or more occurrences of the preceding period or literal char- 
acter, This provides a large fraction of the power of general reg- 
ular expressions with a tiny fraction of the implementation com- 
plexity. We'll use these functions to implement a small but 
eminently useful version of grep (available electronically; see 
“Resource Center,” page 5). 

In Example 1, the function match determines whether a string 
matches a regular expression. If the regular expression begins 
with “A” the text must begin with a match of the remainder of 
the expression. Otherwise, we walk along the text, using match- 
here to see if the text matches at each position in turn. As soon 
as we find a match, we're done. Expressions that contain “*” 
can match the empty string (for example, “.*y” matches “y” 
among many other things), so we must call matchhere even if 
the text is empty. 

In Example 2, the recursive function matchbere does most of 
the work, If the regular expression is empty, we have reached 
the end and thus have found a match, If the regular expression 
ends with °$,” it matches only if the text is also at the end. If 
the regular expression begins with a period, it matches any char- 
acter. Otherwise, the expression begins with a plain character 
that matches itself in the text. A “A” or “$” that appears in the 
middle of a regular expression is thus taken as a literal charac- 
ter, not a metacharacter. 

Notice that matchhere calls itself after matching one charac- 
ter of pattern and string. Thus the depth of recursion can be as 
much as the length of the 

The one tricky case occurs when the expression begins with 
a starred character, “x,” for example. Then we call matchstar 
with three arguments— the operand of the star (x), the pattern 
after the star, and the text; see Example 3. Again, a starred reg- 
ular expression can match zero ¢ . The loop checks 
whether the text matches the remaining expression, trying at 
each position of the text as long as the first character matches 
the operand of the star. 

Our implementation is admittedly unsophisticated, but it works. 
And, at fewer than 30 lines of code, it shows that regular ex- 
pressions don’t need advanced techniques to be put to use. 


Grep 

The pattern-matching program grep, invented by Ken Thomp- 
son (the father of UNEX), is a marvelous example of the val- 
ue of notation. It applies a regular expression to each line of 
its input files and prints those lines that contain matching strings. 
This simple specification, plus the power of regular expres- 
sions, lets it solve many day-to-day tasks, In the following ex- 
amples, note that the regular expression used as the argument 
to grep is different from the wildcard pattern used to specify 
file names. 




































© Search for a name in a set of source files: grep fprinif *c 

Search for a phrase in a set of text files: grep ‘regular expres- 

sion! *txt 

* Filter output from some other program, for example to print 
all error messages: gee *.¢ | grep Error: 

© Filter input to some other program, for example to count non- 
blank lines: grep . #.cpp | wordcount 





With flags to print line numbers of matched lines, count match- 
es, do case-insensitive matching, select lines that don’t match 
the pattern, and other variations of the basic idea, grep is so 
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widely used that it has become the classic example of tool-based 
programming. 
routine of an implementation of grep 
is conventional that UNIX programs return 
and nonzero values for various failures. Our grep, 
like the UNIX version, defines success as finding a matching 
line, so it returns 0 if there were any matches, 1 if there were 
none, and an error occurred. These status values can be 
tested Be other program like a shell. 
‘tion grep scans a single file, 
mostly straightforward, but 
First, the main routine doesn't 
quit if it fails to open a file. This is because it's common to say 
something like 


% grep herpolhode #.# 


and find that one of the files in the directory can't be read. It's 
better for grep to keep going after reporting the problem, rather 
than to give up and force users to type the file list manually to 
avoid the problem file. Second, grep prints the matching line 
and the file name, but suppresses the file name if it is reading 
standard input or a single file. This may seem an odd design, 
but it reflects a style of use based on experience. When given 
only one input, grep’s task is usually selection, and the file name 
would clutter the output. But if it is asked to search through 
many files, the task is most often to find all occurrences of some- 
thing, and the file names are helpful. Compare 

% strings enormous.dll | grep Error: 

with 

% grep grammer #.txt 

Our implementation of match return: on as it finds a 
match, For grep, that is a fine default. But, for implementing a 


substitution (search-and-replace) operator in a text editor, the 
leftmost longest match is more useful. For example, given the 


Example 4: Main routine of an implementation of grep 
that uses match. (Assumes command interpreter expands 
wild cards in file specification on the command line into 
lists of file names. UNIX shells exhibit this behavior, 
although MS-DOS COMMAND.COM does not.) 
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Example 5: The function grep scans a single file, calling 
match on each line, 

text “aaaaa,” the pattern “a*” matches the null string at the be- 
ginning of the text, but the user probably intended to match 
all five characters. To cause match to find the leftmost longest 
string, matchstar must be rewritten to be greedy; Rather than 
looking at each character of the text from left to right, it should 
skip over the longest string that matches the starred operand, 
then back up if the rest of the string doesn't match the rest of 
the pattern. In other words, it should run from right to left. Ex 
ample 6 is a version of matchstar that does leftmost longe: 
matching. This might be the wrong version of matchstar for 
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Example 6: A version of matchstar that does leftmost 
longest matching. 


grep, bec 


tor, it is e: 


What Next? 

Our grep is competitive with system-supplied versions, re- 
gardless of the regular expression. For example, it takes about 
six seconds to search a 40-MB text file on a 400-MHz Pentium 
(compiled with Visual C++). Pathological expressions can cause 
exponential behavior, such as “a*a*a‘ka*a*b” when given the 
input “ ac,” but the exponential behavior exists in 
many commercial implementations, too, A more sophisticat- 
ed matching algorithm can guarantee linear performance by 
avoiding backtracking when a partial match fails; the UNIX 
egrep program implements such an algorithm, as do scripting 
languages. 


it does extra work; but for a substitution opera- 
ntial. 


«xpressions would include character classes like 

to match a single alphabetic character; the ability to 

quote a metacharacter (for example, to search for a literal peri- 

od); parentheses like “(abe)*" for grouping; and alternatives, 
where “abc| def" matches “abe” or “def.” 

The first step is to help match by compiling the pattern into 
a repres tion that is easier to scan, It is expensive to parse a 
character class every time we compare it against a character; a 
precomputed representation based on bit vectors could make 
character classes much more efficient. 

For regular expressions with parentheses and alternatives, 
the implementation must be more sophisticated. One approach 
is to compile the regular expression into a parse tree that cap- 
tures its grammatical structure. This tree is then traversed to 
create a state machine— a table of states, each of which gives 
the next state for each possible input character. The string is 
scanned by the state machine, which reports when it reach- 
€s a state corresponding to a match of the pattern. Another 
approach is similar to what is done in just-in-time compilers: 
The regular expression is compiled into instructions that will 
sean the string; the state machine is implicit in the generated 
instructions, 


Further Reading 
J.E.B. FriedI's Mastering Regular Expressions (O'Reilly & As- 
an extensive treatment of the subject. Reg- 
are one of the most important features of 
some scripting languages; see The AWK Programming Lan- 
guage by A.V. Aho, B.W, Kernighan and PJ. Weinberger (Ad- 
dison-Wesley, 1988) and Programming Perl, by Larry Wall, 
Tom Christiansen, and Randal L. Schwartz (O'Reilly & Asso- 
ciates, 1996). 
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